Practical - Week 1
2025-09-29
Data that can place a particular taxon in a particular location and time can take many forms, depending on:
Presence-only (PO) data
| PROS | CONS |
|---|---|
| huge amounts of data available, easily aggregated | often without details of effort/method, wide variation in data quality |
Presence-absence (PA) data
| PROS | CONS |
|---|---|
| absences are informative, area and effort are measured | less abundant (too time-consuming), methods are species-specific |
Repeated surveys
| PROS | CONS |
|---|---|
| standardised protocols, multiple points in time | expensive, geographically restricted, usually temporally too |
Range-maps
| PROS | CONS |
|---|---|
| rough estimates of the outer boundaries of areas within which species are likely to occur | large spatial and temporal uncertainties |
Data can also be defined by how they were collected.
Structured
Semi-structured
Unstructured (opportunistic)
Finally, data can also be defined by how they are made available for others.
Disaggregated
Aggregated
GBIF is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.
OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development.
eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.
eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.
iNaturalist is one of the world’s most popular nature apps. It allows participants to contribute observations of any organism, or traces thereof, along with associated spatio-temporal metadata.
Observation.org is a global biodiversity platform for citizen science and monitoring, established in 2004. It mainly used in Europe.
IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.
rredlist: https://github.com/ropensci/rredlist
Map of Life assembles and integrates different sources of data describing species distributions worldwide. It is developed by the Center for Biodiversity and Global Change at Yale University.
Chorological maps for the main European woody species is a data paper with a dataset of chorological maps for the main European tree and shrub species, put together by Giovanni Caudullo, Erik Welk, and Jesús San-Miguel-Ayanz.
BBS (Breeding Bird Survey) involves thousands of volunteer birdwatchers carrying out standardised annual bird counts on randomly-located 1-km sites. It’s part of the NBN Atlas.
BIEN is a network of ecologists, botanists, and computer scientists working together to document global patterns of plant diversity, function and distribution.
SiBBr (Brazilian Biodiversity Information System) is an online platform that integrates data and information about biodiversity and ecosystems from different sources, making them accessible for different uses.
sibbr: https://github.com/sibbr
BioTime is an open access global database of assemblage time series for quantifying and understanding biodiversity change.
BioTime Hub: https://github.com/bioTIMEHub
Open means anyone can freely access, use, modify, and share for any purpose.
Public doesn’t mean open
The data on the internet can be public but they are not necessarily open. They can be standard, available in open formats (e.g., csv), and yet, if they don’t have a licence, by default they are closed (all rights reserved).
Open data are licensed under open licenses. Some examples:
CC0: Public domain
CC-BY: Attribution
CC-BY-NC: Attribution - Non Commercial
CC-BY-SA: Attribution - Share Alike
Darwin Core is the internationally agreed data standard to facilitate the sharing of information about biological diversity.
countryCode: The standard code for the country in which the Location occurs. Recommended best practice is to use an ISO 3166-1-alpha-2 country code.
recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence.
Data that are standardized and have an open licence can be shared :)
Choose one taxon and one data source, and try to get distribution data.
Then answer the following 3 questions:
We will use the mammals of the Czech Republic as an example dataset. We will access the data through GBIF using R.
code and data folders inside).File > New project > New directory or Existing directory
We will always load packages into R using the package pacman.
If you attempt to load a library that is not installed, pacman will try to install it automatically.
We will use tidyverse for the manipulation and transformation of data.
We will be using many functions from this library of package, like filter(), mutate(), and later read_csv().
We will use rgbif to download data from GBIF directly into our R session.
We will need to get a taxon ID (taxonKey) for the Mammalia class from the GBIF backbone. For that, we will use another package called taxize.
We will use sf to work with spatial data.
We will use rnaturalearth to interact with Natural Earth and get mapping data (e.g., countries’ polygons) into R.
Create some variables that will be used later.
Define the things you already know you will use later in the script. For instance, we know that we will work with mammals from the Czech Republic and that the data we will get from GBIF are in WGS84 latitude and longitude.
Get a taxon ID for the Mammalia class.
taxon_key <- get_gbifid_(taxon) %>%
bind_rows() %>% # Transform the result of get_gbifid into a dataframe
filter(matchtype == "HIGHERRANK" & status == "ACCEPTED") %>% # Filter the dataframe by the columns "matchtype" and "status"
pull(usagekey) # Pull the contents of the column "usagekey"
taxon_key[1] 359
Basemap of CZ to use later for plotting or checking the dataset.
And now we can use the function occ_count() to find out the number of occurrence records for the entire Czech Republic.
How many occurrence records are in GBIF for the entire Czech Republic?
And how many of those records are mammals?
After this initial exploration, we are ready to download data. Whoop!
To do this, we will use occ_search(). This function has many options that correspond with fields in the GBIF database (DarwinCore terms).
occ_search(
taxonKey = NULL,
scientificName = NULL,
country = NULL,
publishingCountry = NULL,
hasCoordinate = NULL,
typeStatus = NULL,
recordNumber = NULL,
lastInterpreted = NULL,
continent = NULL,
geometry = NULL,
geom_big = "asis",
geom_size = 40,
geom_n = 10,
recordedBy = NULL,
recordedByID = NULL,
identifiedByID = NULL,
basisOfRecord = NULL,
datasetKey = NULL,
eventDate = NULL,
catalogNumber = NULL,
year = NULL,
month = NULL,
decimalLatitude = NULL,
decimalLongitude = NULL,
elevation = NULL,
depth = NULL,
institutionCode = NULL,
collectionCode = NULL,
hasGeospatialIssue = NULL,
issue = NULL,
search = NULL,
mediaType = NULL,
subgenusKey = NULL,
repatriated = NULL,
phylumKey = NULL,
kingdomKey = NULL,
classKey = NULL,
orderKey = NULL,
familyKey = NULL,
genusKey = NULL,
establishmentMeans = NULL,
protocol = NULL,
license = NULL,
organismId = NULL,
publishingOrg = NULL,
stateProvince = NULL,
waterBody = NULL,
locality = NULL,
limit = 500,
start = 0,
fields = "all",
return = NULL,
facet = NULL,
facetMincount = NULL,
facetMultiselect = NULL,
skip_validate = TRUE,
curlopts = list(),
...
)This is not the best way to download data from GBIF. The best way would be to use the function occ_download(), but you will need a user account for this.
For our next practical, please create an account in GBIF.org and follow the script download_mammalsCZ_data_from_GBIF.R in the Week2_gridding_and plotting of the Practical_classes folder to get these data.
Get occurrence records of mammals from Czech Republic.
Records found [15735]
Records returned [500]
No. unique hierarchies [43]
No. media records [500]
No. facets [0]
Args [occurrenceStatus=PRESENT, limit=500, offset=0, taxonKey=359, country=CZ,
fields=all]
# A tibble: 500 × 104
key scientificName decimalLatitude decimalLongitude issues datasetKey
<chr> <chr> <dbl> <dbl> <chr> <chr>
1 5006879567 Ovis aries mus… 50.1 14.6 cdc,c… 50c9509d-…
2 5007205681 Myocastor coyp… 50.1 14.4 cdc,c… 50c9509d-…
3 5007283594 Myocastor coyp… 49.6 17.3 cdc,c… 50c9509d-…
4 5007542247 Capreolus capr… 48.9 14.4 cdc,c… 50c9509d-…
5 5007576064 Capreolus capr… 49.2 16.9 cdc,c… 50c9509d-…
6 5007738330 Rhinolophus hi… 49.4 16.7 cdc,c… 50c9509d-…
7 5007845561 Mustela nivali… 50.2 14.4 cdc,c… 50c9509d-…
8 5008416860 Capreolus capr… 50.0 14.5 cdc,c… 50c9509d-…
9 5036714112 Lepus europaeu… 50.0 14.4 cdc,c… 50c9509d-…
10 5036838815 Ovis aries mus… 50.1 14.6 cdc,c… 50c9509d-…
# ℹ 490 more rows
# ℹ 98 more variables: publishingOrgKey <chr>, installationKey <chr>,
# hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
# lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
# occurrenceStatus <chr>, lifeStage <chr>, taxonKey <int>, kingdomKey <int>,
# phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,
# genusKey <int>, speciesKey <int>, acceptedTaxonKey <int>, …
By default, it will only return the first 500 records
To get all the records, we need to specify a larger limit. Since we have over 15,000 records, we should choose more than 16,000 as the limit. However, this will get super slow, so you could pick 5000.
Finally, we store the result in the object mammalsCZ. We include the option to remove common geospatial issues (e.g., zero coordinates, country coordinate mismatch, invalid coordinate, etc.).
mammalsCZ <- occ_search(
taxonKey = taxon_key, # Key 359 created previously
country = country_code, # CZ, ISO code of the Czech Republic
limit = 16000, # Max number of records to download
hasGeospatialIssue = F # Only records without spatial issues
)
mammalsCZ <- mammalsCZ$data # The output of occ_search is a list with a data object inside. Here we pull the data out of the list.Mammals occurrence records from the Czech Republic
Examine the dataset’s variables and their respective data types: Are they numeric, character, or boolean in nature?
Rows: 15,682
Columns: 228
$ key <chr> "5006879567", "5007205681", "500…
$ scientificName <chr> "Ovis aries musimon (Pallas, 181…
$ decimalLatitude <dbl> 50.06399, 50.08033, 49.59528, 48…
$ decimalLongitude <dbl> 14.57059, 14.41240, 17.26240, 14…
$ issues <chr> "cdc,cdround", "cdc,cdround", "c…
$ datasetKey <chr> "50c9509d-22c7-4a22-a47d-8c48425…
$ publishingOrgKey <chr> "28eb1a3f-1c15-4a95-931a-4af90ec…
$ installationKey <chr> "997448a8-f762-11e1-a439-00145eb…
$ hostingOrganizationKey <chr> "28eb1a3f-1c15-4a95-931a-4af90ec…
$ publishingCountry <chr> "US", "US", "US", "US", "US", "U…
$ protocol <chr> "DWC_ARCHIVE", "DWC_ARCHIVE", "D…
$ lastCrawled <chr> "2025-09-22T23:56:44.581+00:00",…
$ lastParsed <chr> "2025-09-23T14:13:19.446+00:00",…
$ crawlId <int> 560, 560, 560, 560, 560, 560, 56…
$ basisOfRecord <chr> "HUMAN_OBSERVATION", "HUMAN_OBSE…
$ occurrenceStatus <chr> "PRESENT", "PRESENT", "PRESENT",…
$ lifeStage <chr> "Adult", NA, NA, NA, NA, NA, "Ad…
$ taxonKey <int> 6165157, 4264680, 4264680, 52201…
$ kingdomKey <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ phylumKey <int> 44, 44, 44, 44, 44, 44, 44, 44, …
$ classKey <int> 359, 359, 359, 359, 359, 359, 35…
$ orderKey <int> 731, 1459, 1459, 731, 731, 734, …
$ familyKey <int> 9614, 3240572, 3240572, 5298, 52…
$ genusKey <int> 9531221, 3240573, 3240573, 24409…
$ speciesKey <int> 2441110, 4264680, 4264680, 52201…
$ acceptedTaxonKey <int> 6165157, 4264680, 4264680, 52201…
$ acceptedScientificName <chr> "Ovis aries musimon (Pallas, 181…
$ kingdom <chr> "Animalia", "Animalia", "Animali…
$ phylum <chr> "Chordata", "Chordata", "Chordat…
$ order <chr> "Artiodactyla", "Rodentia", "Rod…
$ family <chr> "Bovidae", "Myocastoridae", "Myo…
$ genus <chr> "Ovis", "Myocastor", "Myocastor"…
$ species <chr> "Ovis aries", "Myocastor coypus"…
$ genericName <chr> "Ovis", "Myocastor", "Myocastor"…
$ specificEpithet <chr> "aries", "coypus", "coypus", "ca…
$ infraspecificEpithet <chr> "musimon", NA, NA, NA, NA, NA, N…
$ taxonRank <chr> "SUBSPECIES", "SPECIES", "SPECIE…
$ taxonomicStatus <chr> "ACCEPTED", "ACCEPTED", "ACCEPTE…
$ dateIdentified <chr> "2025-01-04T21:37:05", "2025-01-…
$ coordinateUncertaintyInMeters <dbl> 15, 8, 8, 31, 4038, 26550, 13, 3…
$ continent <chr> "EUROPE", "EUROPE", "EUROPE", "E…
$ stateProvince <chr> "Prague", "Prague", "Olomoucký",…
$ year <int> 2025, 2025, 2025, 2025, 2025, 20…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day <int> 4, 5, 3, 5, 3, 4, 2, 1, 14, 4, 8…
$ eventDate <chr> "2025-01-04T12:09", "2025-01-05T…
$ startDayOfYear <int> 4, 5, 3, 5, 3, 4, 2, 1, 14, 4, 8…
$ endDayOfYear <int> 4, 5, 3, 5, 3, 4, 2, 1, 14, 4, 8…
$ modified <chr> "2025-01-05T22:58:50.000+00:00",…
$ lastInterpreted <chr> "2025-09-23T14:13:19.446+00:00",…
$ references <chr> "https://www.inaturalist.org/obs…
$ license <chr> "http://creativecommons.org/lice…
$ isSequenced <lgl> FALSE, FALSE, FALSE, FALSE, FALS…
$ identifier <chr> "257396416", "257489614", "25737…
$ facts <chr> "none", "none", "none", "none", …
$ relations <chr> "none", "none", "none", "none", …
$ isInCluster <lgl> FALSE, FALSE, FALSE, FALSE, FALS…
$ datasetName <chr> "iNaturalist research-grade obse…
$ recordedBy <chr> "Lioneska", "villllemo", "Václav…
$ identifiedBy <chr> "Lioneska", "villllemo", "Václav…
$ dnaSequenceID <chr> "none", "none", "none", "none", …
$ geodeticDatum <chr> "WGS84", "WGS84", "WGS84", "WGS8…
$ class <chr> "Mammalia", "Mammalia", "Mammali…
$ countryCode <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "C…
$ recordedByIDs <chr> "none", "none", "none", "none", …
$ identifiedByIDs <chr> "none", "none", "none", "none", …
$ gbifRegion <chr> "EUROPE", "EUROPE", "EUROPE", "E…
$ country <chr> "Czechia", "Czechia", "Czechia",…
$ publishedByGbifRegion <chr> "NORTH_AMERICA", "NORTH_AMERICA"…
$ rightsHolder <chr> "Lioneska", "villllemo", "Václav…
$ identifier.1 <chr> "257396416", "257489614", "25737…
$ http...unknown.org.nick <chr> "lioneska", "villllemo", "vaclav…
$ verbatimEventDate <chr> "2025/01/04 12:09", "2025-01-05 …
$ dynamicProperties <chr> "{\"evidenceOfPresence\":\"organ…
$ collectionCode <chr> "Observations", "Observations", …
$ gbifID <chr> "5006879567", "5007205681", "500…
$ verbatimLocality <chr> "Praha-Dubeč, Česko", "Praha 1, …
$ occurrenceID <chr> "https://www.inaturalist.org/obs…
$ taxonID <chr> "340942", "43997", "43997", "421…
$ http...unknown.org.captive_cultivated <chr> "wild", "wild", "wild", "wild", …
$ catalogNumber <chr> "257396416", "257489614", "25737…
$ institutionCode <chr> "iNaturalist", "iNaturalist", "i…
$ vitality <chr> "alive", NA, NA, NA, "alive", NA…
$ eventTime <chr> "12:09:00+01:00", "18:34:08+01:0…
$ identificationID <chr> "580286295", "580568163", "58021…
$ name <chr> "Ovis aries musimon (Pallas, 181…
$ iucnRedListCategory <chr> NA, "LC", "LC", "LC", "LC", "LC"…
$ projectId <chr> NA, NA, NA, NA, "https://www.ina…
$ informationWithheld <chr> NA, NA, NA, NA, NA, "Coordinate …
$ occurrenceRemarks <chr> NA, NA, NA, NA, NA, NA, "Pravděp…
$ distanceFromCentroidInMeters <dbl> NA, NA, NA, NA, NA, NA, NA, 3839…
$ recordedByIDs.type <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ recordedByIDs.value <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identifiedByIDs.type <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identifiedByIDs.value <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ individualCount <int> NA, NA, NA, NA, NA, NA, NA, NA, …
$ samplingProtocol <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ vernacularName <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ habitat <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locality <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identificationVerificationStatus <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ eventType <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identificationRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ dataGeneralizations <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ acceptedNameUsage <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ type <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ datasetID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ language <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ accessRights <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ recordNumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.taxonRankID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ taxonConceptID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ taxonRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ eventID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ otherCatalogNumbers <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ associatedReferences <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ parentEventID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ gadm <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ associatedSequences <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ networkKeys <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locationRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ nameAccordingTo <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ coordinatePrecision <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferencedBy <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ organismQuantity <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ organismQuantityType <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ institutionKey <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ collectionKey <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ preparations <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ nomenclaturalCode <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ institutionID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ disposition <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ bibliographicCitation <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ collectionID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.language <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ footprintWKT <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.modified <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ originalNameUsage <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ elevation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ elevationAccuracy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.b295f32e72712cfa6ea8a0c5effd02a0. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ fieldNumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ higherGeography <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locationAccordingTo <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferencedDate <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceProtocol <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimCoordinateSystem <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ previousIdentifications <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ higherClassification <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceSources <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.ca2af10df069f7bb136331bbf56dff0f. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ ownerInstitutionCode <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ materialEntityID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ footprintSRS <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimIdentification <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.cad7ac2bf910bf964a341390bbf671ef. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locationID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.recordID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ county <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ rights <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.recordEnteredBy <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ organismID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ municipality <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceVerificationStatus <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ establishmentMeans <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ acceptedNameUsageID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ parentNameUsage <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ island <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ materialSampleID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ eventRemarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ subfamily <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimElevation <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ higherGeographyID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestEpochOrLowestSeries <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestEpochOrHighestSeries <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ geologicalContextID <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestPeriodOrLowestSystem <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestEonOrLowestEonothem <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestEonOrHighestEonothem <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestEraOrLowestErathem <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestEraOrHighestErathem <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestPeriodOrHighestSystem <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestAgeOrLowestStage <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ namePublishedInYear <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ lithostratigraphicTerms <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimTaxonRank <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identificationQualifier <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ bed <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ formation <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.7c6eb36e75647675f673eb64f47d6da7. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.66327737867e96c11e11981ee34015bd. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.007fcdadb6565d1ed1b72315ebd20ac6. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.d3cc8ac74f8baed616f2c84f07c7a233. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.1a97f58a971298c71c1476ba348f11e1. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.88b8c1395fa380ccc56dff961ff1cedd. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.22b9ea985ca6e925fd53129340878f10. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.8552d5f41be35d89cbb98e8237c026a7. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.86ddd89f8bcf5f519c417603edd234d6. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5064b328c03546024166c533a776ce16. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.c751f5f745cc4d794c092ab21ff6bdf8. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.c6d3b644819f7bb7007a779a9b8fceec. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.dcdc0c2d528cb18330ad4a56856db3ed. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.02a1149be87e9edb0aa497e31afe07a3. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.56f596244647c2dcb92fa607a36aa258. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.2183b1b175c0d4d4062a90b12ba78265. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.79d2ec723f5d3fb758fc5954ff5daaef. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5470bd0db78d65544c25963b4bcd7c95. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.77845812e5846e98ff4f49d9beb44e35. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.a9577f32baafa53e12b254567552af5e. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.bb817299a5a54aa91c408611dbb5122d. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.05ae6f6e29a31b3be928c05a8748e11f. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.97660a21f2435cce604c66c6f163e3dd. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.07fe054c296139a4c4b9cb56916226da. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.85b733a072a635dc519b3896f5cebe90. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.594b08c316417115e35553092ac68876. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5c54f89ad01da246db5b356befb2190b. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.bc5ff1ae9e2bb01bb3c5fa4f651ef8fb. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.c0ef3a1800882fe27905a8fa3e523d4b. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5eeca0ece84ccaed09ebf47ffe1adb88. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.d33fccc638f772e2005023df63d4028a. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.469b3883081304b435895823d3abcfc7. <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ combinationAuthors <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimScientificName <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.verbatimLabel <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ combinationYear <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ canonicalName <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
Check the data output. How many rows and columns does it have?
Mammals occurrence records from the Czech Republic
How many records do we have?
Data are not ‘good’ or ‘bad’; the quality will depend on our goal.
Some things we can check:
CoordinateCleaner: https://github.com/ropensci/CoordinateCleaner
Automated flagging of common spatial and temporal errors in data.
As an example of data cleaning procedures, we will check the following fields in our dataset:
basisOfRecord: we want preserved specimens or observations.taxonRank: we want records at the species level.coordinateUncertaintyInMeters: we want it to be smaller than 10km.basisOfRecord: we want preserved specimens or observationsbasisOfRecord: we want preserved specimens or observations# A tibble: 7 × 2
# Groups: basisOfRecord [7]
basisOfRecord n
<chr> <int>
1 FOSSIL_SPECIMEN 201
2 HUMAN_OBSERVATION 14146
3 MATERIAL_CITATION 206
4 MATERIAL_SAMPLE 105
5 OBSERVATION 22
6 OCCURRENCE 18
7 PRESERVED_SPECIMEN 984
group_by() is used to group values within a variable
basisOfRecord: we want preserved specimens or observationsUpdate the object by filtering over the basisOfRecord to keep only records that correspond to “preserved specimens” or “human observations”.
Note the use of | (OR) to filter the data. An alternative is filter(basisOfRecord %in% c("PRESERVED_SPECIMEN","HUMAN_OBSERVATION")).
taxonRank: we want records at the species leveltaxonRank: we want records at the species levelUpdate the object by filtering over taxonRank to keep only records that correspond to the “species” level.
coordinateUncertaintyInMeters: we want them to be smaller than 10kmmammalsCZ %>%
filter(coordinateUncertaintyInMeters >= 10000) %>%
select(scientificName,
coordinateUncertaintyInMeters,
stateProvince)# A tibble: 426 × 3
scientificName coordinateUncertaint…¹ stateProvince
<chr> <dbl> <chr>
1 Rhinolophus hipposideros (Bechstein, 18… 26550 Jihomoravský
2 Myocastor coypus (Molina, 1782) 26550 Jihomoravský
3 Vulpes vulpes (Linnaeus, 1758) 26518 Plzeňský
4 Myocastor coypus (Molina, 1782) 26421 Prague
5 Myotis myotis (Borkhausen, 1797) 26389 Královéhrade…
6 Myotis emarginatus (E.Geoffroy, 1806) 26389 Královéhrade…
7 Capreolus capreolus (Linnaeus, 1758) 26389 Středočeský
8 Lutra lutra (Linnaeus, 1758) 26550 Kraj Vysočina
9 Barbastella barbastellus (Schreber, 177… 26550 Jihočeský
10 Lepus europaeus Pallas, 1778 26421 Prague
# ℹ 416 more rows
# ℹ abbreviated name: ¹coordinateUncertaintyInMeters
coordinateUncertaintyInMeters: we want them to be smaller than 10kmUpdate the object by filtering over coordinateUncertaintyInMeters to keep only records that have less than 10000 meters of uncertainty.
How are the records distributed?
We’ll get to this next week :)
And finally, a simple trick to produce separate maps per order.